Skip to content

add support to store packages/archives locally #1685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

VarshaUN
Copy link
Collaborator

@VarshaUN VarshaUN commented Jun 10, 2025

Fixes #1683
Adds support for storing scanned packages/archives locally if checked .

  1. As of now scancode.io scans packages but has no extra option for storing scanned packages/archives locally.
  2. This PR adds this feature where in an extra checkbox has been given optionally for users if they want to store it locally as shown in fig below .
  3. As discussed with mentors , it was suggested to add it in general for all pipelines that are present in scancode.io

Screenshot 2025-06-08 194209

Signed-off-by : Varsha U N varshaun58@gmail.com

Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
@tdruez
Copy link
Contributor

tdruez commented Jun 11, 2025

@VarshaUN Thanks for your work on this PR, it's a great start! That said, it currently lacks some important context. When submitting a PR, try to assume that the reviewer isn’t familiar with anything outside of what’s included here (like your proposal or earlier discussions).

To improve this, please start by clearly documenting:

  • The goals of the changes.
  • Any new concepts introduced.
  • How to set up and use the new features.
  • How this change fits into the larger project context.
  • Examples of use cases.
  • Known limitations or edge cases.
  • Potential side effects or breaking changes.

This will not only help with the current review process but will also serve as a strong foundation for future documentation improvements.

It would also be helpful to include:

  • A summary of what has been implemented so far.
  • Any remaining TODOs.
  • Open questions or design points that need further discussion.

Keep up the good work—looking forward to seeing this evolve!

@VarshaUN
Copy link
Collaborator Author

VarshaUN commented Jun 11, 2025

Sure , I have mentioned it above , if needed I could share my Proposal for the context :)

Signed-off-by: Varsha U N <varshaun58@gmail.com>
@VarshaUN VarshaUN requested a review from pombredanne June 16, 2025 16:37
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VarshaUN thanks for the initial PR!

Please see my comments for your consideration. Presently your tests are failing, and we can't configure locally your branch so I cannot do a detailed review on the functionality added by you yet. Please fix this so we can proceed towards that.

Could you also add some details of what you're trying to achieve in #1683 for better context as @tdruez has already requested?

Thanks, looks like a good start otherwise.

@@ -0,0 +1,28 @@
# Generated by Django 5.1.1 on 2025-05-26 09:19

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please merge your migrations into one file, since they are for the same fields?


from django.db import migrations, models


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please merge your migrations into one file, since they are for the same fields?
You can always generate multiple migration files as you go, but it's best to merge them and reorder based on merged branches before finally merging itself.

('download_date', models.DateTimeField(auto_now_add=True, help_text='Date when the package was downloaded or added.')),
('scan_log', models.TextField(blank=True, help_text='Log output from scanning the package.')),
('scan_date', models.DateTimeField(blank=True, help_text='Date when the package was scanned.', null=True)),
('project', models.ForeignKey(editable=False, on_delete=django.db.models.deletion.CASCADE, related_name='downloadedpackages', to='scanpipe.project')),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to put projects on_delete like this here.

('download_date', models.DateTimeField(auto_now_add=True, help_text='Date when the package was downloaded or added.')),
('scan_log', models.TextField(blank=True, help_text='Log output from scanning the package.')),
('scan_date', models.DateTimeField(blank=True, help_text='Date when the package was scanned.', null=True)),
('project', models.ForeignKey(editable=False, on_delete=django.db.models.deletion.CASCADE, related_name='downloadedpackages', to='scanpipe.project')),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to think how we are going to handle same package archive used in two different projects using different pipelines, or done with different SCIO versions.

Example, same package archive is scanned with inspect_packages and scan_sing;e_package

Additionally, we need to look into having a help text show up with projects which were run on the same package.

Consider this when you build the models, but we can also update them later as this is preliminary anyway.

@@ -50,6 +55,7 @@ def steps(cls):
cls.flag_empty_files,
cls.flag_ignored_resources,
cls.scan_for_application_packages,
cls.store_package_archives,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get what you are doing here.

We need to do this processing potentially before we even download the inputs so that we don't re-download these?

  • check for the URL in the index, use that if present
  • download and compute checksum, check in the checksum index, and use that if present

Additionally we also need to consider if we want to rescan again based on if we have a newer version of scancode or we are running different pipelines.

You presently are running this much after we download and scan things.

VarshaUN and others added 16 commits June 28, 2025 17:56
Signed-off-by: Varsha U N <varshaun58@gmail.com>
…-org#1686)

* Upgrade packageurl-python to latest version aboutcode-org#1383

Signed-off-by: tdruez <tdruez@nexb.com>

* Add make_mock_response to simplify setup in unit test aboutcode-org#1383

Signed-off-by: tdruez <tdruez@nexb.com>

* Add support for fetching Package URLs (fetch_package_url) aboutcode-org#1383

Signed-off-by: tdruez <tdruez@nexb.com>

* Add Package URL placeholder in InputsBaseForm aboutcode-org#1383

Signed-off-by: tdruez <tdruez@nexb.com>

* Add CHANGELOG entry aboutcode-org#1383

Signed-off-by: tdruez <tdruez@nexb.com>

---------

Signed-off-by: tdruez <tdruez@nexb.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: tdruez <tdruez@nexb.com>
* Upgrade Ace library to latest version 1.42.0

Signed-off-by: tdruez <tdruez@nexb.com>

* Upgrade Bulma CSS library to latest version 1.0.4

Signed-off-by: tdruez <tdruez@nexb.com>

* Refine the CSS for the Resource viewer

Signed-off-by: tdruez <tdruez@nexb.com>

---------

Signed-off-by: tdruez <tdruez@nexb.com>
…1693)

* Display matched snippets details in "Resource viewer" aboutcode-org#1688

Signed-off-by: tdruez <tdruez@nexb.com>

* Remove print statements used for debugging aboutcode-org#1688

Signed-off-by: tdruez <tdruez@nexb.com>

---------

Signed-off-by: tdruez <tdruez@nexb.com>
aboutcode-org#1694)

In preparation of adding parent_path as a field aboutcode-org#1691

Signed-off-by: tdruez <tdruez@nexb.com>
Signed-off-by: tdruez <tdruez@nexb.com>
* Add d2d symbols matching for winpe macho binaries

Reference: aboutcode-org#1431
Reference: aboutcode-org#1432
Reference: aboutcode-org#1433

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Use newly released source-inspector v0.6.0

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Bump binary-inspector to v0.1.2

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

* Add test as examples for macho/winpe symbol matching

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>

---------

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: tdruez <tdruez@nexb.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
VarshaUN and others added 13 commits July 1, 2025 17:51
Co-authored-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Signed-off-by: Varsha U N <varshaun58@gmail.com>
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Here are the results from the review.


# Package storage settings

ENABLE_LOCAL_PACKAGE_STORAGE = env.bool("ENABLE_LOCAL_PACKAGE_STORAGE", default=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our review session what about using this instead?

ENABLE_DOWNLOAD_ARCHIVING = env.bool("ENABLE_DOWNLOAD_ARCHIVING", default=False)
#localstorage, s3
DOWNLOAD_ARCHIVING_PROVIDER = env.str("DOWNLOAD_ARCHIVING_PROVIDER", default=None) 
# for local storage we would store the root path in that setting 
DOWNLOAD_ARCHIVING_PROVIDER_CONFIGURATION = env.dict("DOWNLOAD_ARCHIVING_PROVIDER_CONFIGURATION", default=None) 

@@ -54,6 +55,8 @@
path("", RedirectView.as_view(url="project/")),
]

urlpatterns += static(settings.MEDIA_URL, document_root=settings.MEDIA_ROOT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use media for our storage, instead we are using our own thing.

@@ -178,6 +179,12 @@ def __init__(self, *args, **kwargs):
pipeline_choices = scanpipe_app.get_pipeline_choices(include_addon=False)
self.fields["pipeline"].choices = pipeline_choices

self.fields["use_local_storage"].label = "Store packages locally"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave forms for later

@@ -585,6 +585,7 @@ class Project(UUIDPKModel, ExtraDataFieldMixin, UpdateMixin, models.Model):
)
notes = models.TextField(blank=True)
settings = models.JSONField(default=dict, blank=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the latest review, we do not need models yet.

@@ -20,9 +20,15 @@
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
# Visit https://github.com/aboutcode-org/scancode.io for support and download.

import logging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use instead a modification of the super class, and not a modification to each of the pipelines.

@@ -20,30 +20,36 @@
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
# Visit https://github.com/aboutcode-org/scancode.io for support and download.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not change this for now. Instead create a new archiving.py module where you can import selectively code from fetch.py

@@ -57,6 +57,11 @@ <h2 class="subtitle mb-0 mb-4">
<label class="label" for="{{ form.pipeline.id_for_label }}">
{{ form.pipeline.label }}
</label>
<label class="checkbox" for="{{ form.use_local_storage.id_for_label }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this. No forms just yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not commit a binary for your tests. A small one line text file will just work as well.

@@ -38,6 +40,7 @@
from django.conf import settings
from django.contrib.auth import get_user_model
from django.core.exceptions import ValidationError
from django.core.files import File
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the tests to something that tests putting and getting the files instead in test_archiving.py

@@ -0,0 +1,93 @@
# Generated by Django 5.1.1 on 2025-07-09

import django.db.models.deletion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No model change so no migrations.

@pombredanne
Copy link
Member

Here are the notes from our review call:

The overall design idea would be to have small JSON files stored side-by-side with the archived inputs.

The content would be stored by hash like today, and we would have some extra metadata file, one for each origin (download URL, date and filename).

We could have a simple base class to get/put files in the archive and a local file system implementation for now, enable with a global settings.

# the backend would need to have these OBJECTS and functions:

class Download:
    sha256: str
    download_date: str
    download_url: str
    filename: str
    
   

class DownloadStore:

   def list():
      """Return an iterable of all the stored downloads"""
       
   def get(sha256_checksum: string):
      """Return a download object for this checksum or None."""

   def put(content: bytes, download_url: string, download_date: string, filename: string):
      """Store content with its metadata. Return a Download object on success. Raise an exception on error."""

   def find(download_url: [string|None], filename: [string|None], download_date: [string|None]):
      """Return a download object for this checksum or None."""


class LocalFilesystemProvider(DownloadStore):
    """Use a local storage for downloads"""
    root_path: Path # absolute path this would be a global setting for an SCIO instance


class S3LikeProvider(DownloadStore):
    """Use a an S3 bucket for download storage"""
    bucket_name: str
    aws_userid: str
    aws_apikey: str
    other_aws_creddentials: str

class SftpProvider(DownloadStore):
    """Use a an SFTP/SSH account for download storage"""
    host: str
    root_path: str
    ssh_creddentials: str

The files would be stored with a Major by checksum, and with immutable metadata.

The local storage looks like this:

<checksum>/
    content: actual unique blob for that checksum
    <origin-hash>.json: actual unique JSON for an origin
    <origin-hash>.json: actual unique JSON for an origin
    <origin-hash>.json: actual unique JSON for an origin
    <origin-hash>.json: actual unique JSON for an origin

or an example:

# Stored locally at: 

59/4c/67/807fb16238b30c44bdf74f36c02cdf22d1c8cda91ef8a0ed8dabf5620a/
   content
   origin-234682346.json # here 234682346 is a computed hash based on filename, download_date and url
    {
    "sha256": "594c67807fb16238b30c44bdf74f36c02cdf22d1c8cda91ef8a0ed8dabf5620a",
    "filename": "MarkupSafe-2.0.1.tar.gz",
    "download_date": "2025-03-12-13:10",
    "url": "https://files.pythonhosted.org/packages/bf/10/MarkupSafe-2.0.1.tar.gz"
    }
   origin-23sdsdf4682346.json
    {
    "sha256": "594c67807fb16238b30c44bdf74f36c02cdf22d1c8cda91ef8a0ed8dabf5620a",
    "filename": "MarkupSafe-2.2.1.tar.gz",
    "download_date": "2025-03-15-13:10",
    "url": "https://files.pypi.org//MarkupSafe-2.2.1.tar.gz"
    }

Next step: where and when to store/archive downloads?

  • IN STEP: have a step to add to all relevant pipelines: need to update all the pipelines. The current design. Problematic!

  • AS PIPELINE: have an add-on pipeline: some pipelines will not have downloads... it could happen before or after inputs are processed. Problematic!

  • CORE: have a feature in the base Pipeline class and settings to enable that

    • global settings: downloads are always archived
    • per project:
      • downloads are archived if a checkbox/flag has been set for a project
    • per input setting: downloads are archived if an "archive" tag has been set for an input

How to compute an origin file path?


>>> content=b"serfsrfserwe"
>>> content_sha256=sha256(content).hexdigest()
>>> # create paths for storage
>>> # write content blob
....
>>> # write metadata
>>> dnl ={
...     "sha256": "594c67807fb16238b30c44bdf74f36c02cdf22d1c8cda91ef8a0ed8dabf5620a",
...     "filename": "MarkupSafe-2.0.1.tar.gz",
...     "download_date": "2025-03-12-13:10",
...     "url": "https://files.pythonhosted.org/packages/bf/10/MarkupSafe-2.0.1.tar.gz"
...     }
>>> to_hash= b"{dnl['filename']}{dnl['download_date']}{dnl['download_url']}"
>>> origin_filename= sha256(to_hash).hexdigest()
>>> with open(f"origin-{origin_filename}.json", "w") as output:
...   json.dump(dnl, output, indent=2)
... 
... 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Add support to store scanned-packages/archives locally
4 participants